Homework 1: Introduction to Optimization and Least Squares
Instructions
Please submit a .qmd file along with a rendered pdf to the Brightspace page for this assignment. You may use whatever language you like within your qmd file, I recommend python, julia, or R.
Problem 1: Gradient Descent
- Consider the mathematical function defined on \(f: \mathbb{R}^2\,\to\, \mathbb{R}\):
\[ f(x,y) = (x-1)^2 + (y+2)^2, \]
Find the single critical point of this function and show that it is a local minimum (in this case, this will also be a global minimum).
- Now consider a new objective function that depends on a parameter \(b\): \[ f(x,y) = x^2 + by^2 \] Here we will look at two different values of \(b\), \(b=3\) and \(b=10\). The global minimum of this function occurs at the point \(x^* = 0\), \(y^*=0\) no matter what the value of \(b\). Suppose that we didn’t know this and wanted to find the minimum of this function using gradient descent instead of direct calculation.
- First write code to perform the gradient descent algorithm, that is perform the iteration: \[ \mathbf{v}_{n+1} = \mathbf{v}_n - k \nabla f(\mathbf{v}_n), \]
where the vector \(\mathbf{v} = \begin{bmatrix} x & y\end{bmatrix}^T\) and \(k\) is the learning rate.
- Then test the performance of your algorithm as a function of the learning rates \(k\) by performing 100 iterations of the algorithm for 100 values of \(k\) equally spaced between \(k=0.01\) and \(k=0.3\). Start with an initial guess of \(\mathbf{v}_0 = \begin{bmatrix} b & 1\end{bmatrix}^T\). Do this for \(b=3\) and \(b=10\). Make separate plots for \(b=3\) and \(b=10\) of the log base 10 of the error (in this case it is \(\sqrt{x_{100}^2+y_{100}^2}\)) for the final value of the iteration versus the value of \(k\). How does learning rate relate to the final value of the error? For which value of \(b\) does the algorithm have the ability to converge fastest (have the lowest value of the error at the end)?
Note: For some combinations of \(k\) and \(b\), the algorithm won’t converge to the right answer, i.e. the error will grow with time. To make your plot easier to read, don’t plot the error for iterations that didn’t converge.
- As \(k\) increases, for one or both values of \(b\), you will observe a point where the trend of final error versus learning rate reverses direction. Pick a value of \(k\) very close to the point where this occurs, and make a contour plot of the function \(f\) and the trajectory of the iterations for the gradient descent algorithm for that value of \(k\) superimposed over the contour plot. What do you observe?
Note: The differences that you observe here are a special case of a more general phenomenon: the speed of convergence of gradient descent depends on something called the condition number of the Hessian matrix (the matrix of the 2nd order partial derivatives) of the target function. The condition number for a symmetric matrix is just the ratio of the largest to smallest eigenvalues, in this case the condition number is \(b\) (or 1/\(b\)). Gradient descent performs worse and worse the larger the condition number (and large condition numbers are problematic for a wide variety of other numerical methods).
Problem 2: Solving Least Squares Problems
Generate a random \(20\times 10\) matrix \(A\) and a random 20-vector \(b\) (use a Gaussian distribution). Then, solve the least squares problem: \[ \min_{\mathbf{x}\in \mathbb{R}^{10}} \|A\mathbf{x} - \mathbf{b}\|^2 \] in the following ways:
Multiply \(\mathbf{b}\) by the Morse-Penrose Pseudoinverse \(A^+\).
Use built in functions to solve the least squares problem (i.e. in python numpy.lstsq, in R lm, and in Julia the backslash operator).
Using the \(QR\) factorization of \(A\). This factorization rewrites \(A\) as: \[ A = \begin{bmatrix} Q & 0\end{bmatrix} \begin{bmatrix} R & 0 \end{bmatrix}^T, \] where \(Q\) is an orthonormal matrix and \(R\) is upper triangular. The least squares solution equals: \[ \mathbf{x} = R^{-1}Q^T\mathbf{b} \]
Verify that each of these solutions are nearly equal and that the residuals \(A\mathbf{x}-\mathbf{b}\) are orthogonal to the vector \(A\mathbf{x}\)
Problem 3: Iterative Solutions to Least Squares
Although the pseudoinverse provides an exact formula for the least squares solutions, there are some situations in which using the exact solution is computationally difficult, particularly when the matrix \(A\) and vector \(\mathbf{b}\) have a large number of entries. In this case, \(AA^T\), which is an \(m\times m\) matrix if \(A\) is \(m\times n\), may require an enormous amount of memory. In these cases it may be better to use an approximate solution instead of the exact formula. There are many different approximate methods for solving least squares problems, here we will use an iterative method developed by Richardson.
This method begins with an initial guess \(\mathbf{x}^{(0)} = 0\) and calculates successive approximations as follows:
\[ \mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} - \mu A^T\left(A\mathbf{x}^{(k)}-\mathbf{b}\right) \]
Here \(\mu\) is a positive paramter that has a similar interpretation to the learning rate for gradient descent. A choice that guarantees convergence is \(\mu \leq \frac{1}{\|A\|}\). The iteration is terminated when the change in the residual \(\|A^T(Ax^{(k)} − b)\|\) after successive steps is below a user determined threshold, which indicates that the least squares optimality conditions are nearly satisfied.
- Suppose that \(\mathbf{x}\) is a solution to the least squares problem: \[ \mathbf{x} = A^+\mathbf{b} \]
Show by substitution of the formula for the pseudoinverse that \(\mathbf{x}\) is a of the iteration scheme, i.e. that: \[ \mathbf{x} = \mathbf{x} - \mu A^T\left(A\mathbf{x}-\mathbf{b}\right) \]
- Generate a random 20 × 10 matrix \(A\) and 20-vector \(\mathbf{b}\), and compute the least squares solution \(\mathbf{x} = A^+\mathbf{b}\). Then run the Richardson algorithm with \(\mu = \frac{1}{\|A\|^2}\) for 500 iterations, and plot \(\|\mathbf{x}^{(k)}-\mathbf{x}\|\) to verify that \(\mathbf{x}^{(k)}\) is converging to \(\mathbf{x}\)